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Abstract 

Background: Genetic relatedness or similarity between individuals is a key concept in population, quantitative and 
conservation genetics. When the pedigree of a population is available and assuming a founder population from 
which the genealogical records start, genetic relatedness between individuals can be estimated by the coancestry 
coefficient. If pedigree data is lacking or incomplete, estimation of the genetic similarity between individuals relies 
on molecular markers, using either molecular coancestry or molecular covariance. Some relationships between 
genealogical and molecular coancestries and covariances have already been described in the literature. 

Methods: We show how the expected values of the empirical measures of similarity based on molecular marker 
data are functions of the genealogical coancestry. From these formulas, it is easy to derive estimators of 
genealogical coancestry from molecular data. We include variation of allelic frequencies in the estimators. 

Results: The estimators are illustrated with simulated examples and with a real dataset from dairy cattle. In general, 
estimators are accurate and only slightly biased. From the real data set, estimators based on covariances are more 
compatible with genealogical coancestries than those based on molecular coancestries. A frequently used 
estimator based on the average of estimated coancestries produced inflated coancestries and numerical instability. 
The consequences of unknown gene frequencies in the founder population are briefly discussed, along with 
alternatives to overcome this limitation. 

Conclusions: Estimators of genealogical coancestry based on molecular data are easy to derive. Estimators based 
on molecular covariance are more accurate than those based on identity by state. A correction considering the 
random distribution of allelic frequencies improves accuracy of these estimators, especially for populations with 
very strong drift. 



Background 

The concept of coancestry (or kinship) between two indi- 
viduals plays a central role in practical applications of 
genetics. In animal breeding, coancestry coefficients are 
required both to estimate genetic parameters and to carry 
out genetic evaluations [1]. In sociobiology, they are 
important to make evolutionary interpretations of social 
behavior and to determine parameters of the biology of 
reproduction. In the field of animal conservation, they 
constitute fundamental tools to estimate inbreeding 
depression and to optimize genetic management in a con- 
servation program. Several estimators of coancestries 
based on molecular information have been proposed, 
including recent estimators that are designed to deal with 
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a large number of markers [2-7]. These estimators are 
based on intuitive basic identities that were explicitly 
shown by Cockerham [8] (and also [9]), namely, that 
resemblance between genotypes is a function of coancestry 
(identity by descent) and allelic frequencies at the base 
population. Interest in this subject has grown with the use 
of dense marker data. However, this body of literature is 
poorly known in the human and animal genetics commu- 
nities. The aim of this work is to build estimators of gen- 
ealogical coancestry from molecular coancestries and 
molecular covariances and to illustrate their behavior 
based on simulations and a real data set. 

Methods 

In the following sections, refers to the gene frequency 
value for genotypes AA, Aa and aa, coded as 1, 1/2 and 
0, respectively, of individual i at locus k where i = 1, 
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« and k = 1, L. Gene frequency is half the gene content 
(number of copies of the reference allele A). Two ani- 
mals will be referred to by indexes i and /' and two loci 
by k and /. Allelic frequency in the base population is 
notated by p. Loci will be assumed to be neutral. 

Genealogical coancestry 

In both population and quantitative genetics, the genetic 
relationship between individuals can be quantified by 
Malecot's coefficient of coancestry (or kinship) [10]. The 
coancestry coefficient,^, between individuals i and / is 
defined as the probability that, at a random, neutral, 
autosomal locus, an allele drawn randomly from indivi- 
dual i is identical by descent (IBD) to an allele drawn ran- 
domly from individual /'. The inbreeding coefficient of an 
individual i, F it is defined as the probability that the two 
alleles carried by this individual at a given locus are IBD. 
The inbreeding of an individual equals the coancestry 
between its parents. Finally, the self-coancestry fa of an 
individual equals l/2(l+i- , / ). These coefficients can be 
estimated from pedigrees using the tabular method [11]. 
For diploid individuals, twice the coancestry coefficient is 
the additive relationship coefficient, which describes the 
ratio between the genetic covariance between individuals 
and the genetic variance of the base population. 

Molecular coancestry 

If n individuals have been genotyped for one molecular 
marker, the molecular coancestry (or kinship), f Mij 
between individuals i and /', is the probability that two 
alleles at the locus taken at random from each indivi- 
dual are equal (identical by state, IBS). The coancestry 
concept includes the self-coancestry of an individual 
with itself, f Mii , in which case two alleles are drawn 
with replacement within individuals. Analogously, F Mi is 
the molecular inbreeding of individual i, i.e. the prob- 
ability that the two alleles carried by this individual at a 
given locus are IBS. 

By definition,^,- = 1/2(1+ F MU ). Molecular coancestry 
between individuals i and / can be calculated at a given 
locus k as: 

gikgjk + (l -gat) (l -gjk) 
and averaged across loci as: 

/Mj = 7 I] [Sikgjk + (1 " gik) (1 " gjk)] (1) 
L k 

although other alternatives could be considered [7]. 

Molecular (co)variance of gene frequencies 

If a set of individuals has been genotyped for several 
loci, we can calculate, for each individual, the variance 



of the gene frequencies across loci and for each pair of 
individuals, the covariance between two individuals, also 
across loci. The covariance between individuals i and /' 
can be calculated as: 

Cov Mij = Cov(gi,gj) = 

\ (a* " *) (s* " &) 

k 

where gi = — X! gik > an d L is the number of loci. 

It is important to emphasize that both molecular coan- 
cestry and molecular covariance are empirical measures of 
genetic similarity, and do not rely on any assumption 
about how the genotypes were generated. Notice that in 
this definition Cov M has to be computed over one or two 
individuals at a time and across loci. Therefore, it can be 
applied to one individual, or to individuals from different 
populations. Loci are considered as exchangeable (in the 
statistical sense), similar to how loci are treated in the con- 
text of gene dropping analysis where, instead of averaging 
the results over loci we can, equivalently, start the gene 
dropping analysis with just one locus and average over 
many replicates [12]. 

Relationships between genealogical and molecular 
coancestry and molecular covariance 

Here, we provide an intuitive explanation of Cockerham's 
[8] derivation. If the individuals are genealogically con- 
nected, the genealogical coancestry can also be defined as 
the molecular coancestry for 'virtual' alleles at loci that 
are all different in the founder population. For instance, 
in the gene dropping analysis [4], we start with a founder 
population where n founders have many independent 
loci, each with 2n different alleles present in the founder 
population. If we then calculate the molecular coancestry 
of each pair of individuals and average over many loci, 
we recover precisely the same coancestry values as those 
calculated by, for example, the tabular or path coefficient 
methods. 

Let us imagine now, that to each one of the 2n alleles at 
a locus in the base population, we assign a tag at random 
that indicates whether the allele is A or a with probability 
p and q = 1-p (because this assignment has been done at 
random, the genotypic frequencies AA, Aa and aa will be 
in Hardy- Weinberg equilibrium). For this locus, the 
molecular coancestry between two individuals will be the 
probability that two alleles, taken at random from each 
individual have the same tag (thus are IBS). This could 
happen in two ways: either because they have become 
IBD as genealogy progresses (i.e. they are copies of the 
same unique allele from the base population, with prob- 
ability fij), or because they are not IBD (with probability 
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1 -fij) but have the same tag in the base population (with 
probability p + q ). Therefore, on expectation, 



E(fM il )=f^^-mp 2+ i 2 y 



(3) 



This expression can be obtained from Equation (6) in 
reference [8] by summing the two events of IBS (A = A 
or a = a), weighted by probabilities p and q. The rela- 
tionship between genealogical coancestry and molecular 
covariance, shown in [8], is also known from standard 
population genetics (e.g. [1]). Briefly, two gametes co- 
vary (are identical) with a probability (and correlation) 
fii and thus, the covariance of the gene frequency of two 
individuals across loci (replicates) is (assuming the same 
p for all loci): 



E (Cov Mij ) = pqfij. 



(4) 



Alternative derivations of these expressions (3) and (4) 
are given in the Appendix. A simple relationship exists 
between the expectations of molecular coancestry and 
molecular covariance: 

E (ft*,) = p 2 +q 2 + 2E (Cov Mij ) . 

From expressions (3) and (4), two different method-of- 
moments estimators of fy can be obtained by reversing 
the formulas: 



ffM.y = ^M 9 



P 2 + 4 2 
2pq 



fcov M - 



Cov Mij 
pq 



(5) 



(6) 



Expressions (3) and (5) are well known [2], whereas 
(6) does not seem, to our knowledge, to have been used 
previously. 

Accounting for variation of allelic frequencies 

The above formulas refer to a scenario in which the 
base population has one or many independent loci with 
a common allelic frequency p. If this is not the case and 
p for individual loci is a random variable that has been 
sampled from a distribution with mean p and variance 
Var (p), taking expected values across loci, we obtain: 

E(f Mli )=E{p 2 ) + E{q 2 ) + 2f lj E(pq) 

E (Cov Mij ) = fyE (pq) . 

Then, using Var (p) = Var (q) = E(p 2 ) — p 2 and 
E{pq) = pq — Var{p) , we obtain 



E(f M ^=p 2 + q 2 + 2Var(p) 
+ 2f,j[pq-Var(p)] 

E (Cov Mij ) = Var (p) +f lj [pq- Var (p)] . 



(7) 



(8) 



The first term involving Var (p) represents a bias that 
results from an artificial covariance between individuals 
(even between unrelated ones) that is caused by varia- 
tion in allele frequencies between loci. Equation (8) is 
derived as follows. As shown in the Appendix, the 
expectation of the molecular covariance between indivi- 
duals i and /' for a unique allele frequency p is 

E {Cov Mij ) = E (gigj) - E (gt) E (gj) 

where E(ggj) = p 2 + pqfg 
and E (g t ) E (gj) = p 2 . 

For random allele frequencies, in addition to averaging 
across the sampling distribution of individuals i and / in 
the population [E populatior ) one has to average also 
across allele frequencies (£; oc ,), and the expression above 
becomes 

E (C0VM tj ) = Elod (Epopulation (gig;)) 
Eloci (Epopulation (gi)) E\ 0 ci (Epopulation (&;)) 
= Elod (Pk + Pk4kfij) - P 2 ' 

which, after algebra, gives equation (8). 

Therefore, with varying allele frequencies, estimators 
of genealogical coancestry based on equations (5) and 
(6) can be built as 



ffM.ij 



2[pq-Var(p)] 

p 2 +q 2 + 2Var (p) 
~ 2[pq-Var(p)] 



(9) 



1 



'' M * = [p4-Var(p)} 
Var (p) 
[pq - Var (p)] ' 



Cov Mij 



(10) 



These estimators use the same notation as expressions 
(5) and (6); including or not variation in allelic frequen- 
cies will depend on the context. Assuming that the allele 
frequencies are random draws from a Beta distribution 
with parameters a and /J, p and Var (p) are al (a + j3) 
and a//3 [{a + fi) 2 (a + f5 + 1)], respectively.Thus, to 
extrapolate from molecular coancestry or molecular cov- 
ariance to genealogical coancestry requires that the dis- 
tribution of the base population allele frequencies is 
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known, or at least its first and second moments are 
known. However, for practical applications, both p and 
Var ip) can be replaced by their estimates from the cur- 
rent population, namely 



L k 

1 / .w 

Var(p) = -E{Pk-p) 



where 



n ' ' 



(ID 



(12) 



(13) 



These estimators differ slightly from the combined use 
of Equations (7) to (12); in Equations (14) and (15), 
individual allele frequencies g^ are centered with refer- 
ence to allele frequencies pj, across individuals but 
within loci, whereas in Equations (7) to (12), covariances 
and coancestries /jv^and Coi^are computed for each 
pair of individuals as shown in Equations (1) and (2), i.e. 
individual allele frequencies are centered using fre- 
quencies across loci but within individuals: J = 7 E SSk ■ 

L k 

Here, loci are not exchangeable in the same sense as for 
equations (7) and (8), because loci with different allele 
frequencies in the population will contribute more or 
less to the covariances. 



Equations 5-6 and to 9-10 (using when necessary 
Equations 11 to 13) will be implemented in the 
simulations. 

Van Raden's estimators 

These four methods will be compared by simulation 
with one of the methods proposed by Van Raden [5], 
which can be seen as an implementation of expression 
(6). In the first method proposed by Van Raden, across 
individual allele frequencies were computed (not neces- 
sarily using Equation (11)), and then estimators of mole- 
cular covariance were computed for each locus and then 
averaged over total molecular variance as follows: 



fvRl.ij 



T,(Sik-Pk)[g)k-fa) 

_k 

Epfe(i-pfe) 

k 



(14) 



This method corresponds to positing a linear model 
where, for a hypothetical quantitative trait, the genetic 
value of an individual is the sum of independent marker 
effects; overall (i.e., due to the sum of the effects of all loci) 
covariance among individuals in the sample is computed 
first, and then standardized by the overall variance of a 
base population in Hardy- Weinberg equilibrium with 
allele frequencies equal to that observed in the sample, to 
arrive to additive relationships. In the second method of 
Van Raden (later used, for example, in [13]), estimators of 
genealogical coancestry are computed as in Equation (14) 
for each locus and then averaged, as follows: 



VR2, ij 



?ik - Pk) {gjk ~ h) 

h (i - k) 



(15) 



The main difference between estimators (14) and (15) 
is that less polymorphic loci get more credit in estima- 
tor (15). Note that Equation (15) is undefined for pk 
equal to 0 or 1, which is not the case for Equation (14). 



Simulation 

A population was bred from a base (founder) population 
of 20 individuals. One hundred or 10,000 biallelic loci 
representing single nucleotide polymorphism (SNP) 
markers, distributed over 10 chromosomes, were simu- 
lated. Loci were autosomal, unlinked, neutral, without 
mutation, and followed Mendelian inheritance. In the 
first scenario, at each locus, alleles at the founder popu- 
lation were sampled with a fixed probability value of p = 
0.5. In the second scenario, at each locus, alleles were 
sampled with a probability taken from a flat Beta distri- 
bution B(l, 1). Therefore, there was Hardy- Weinberg 
equilibrium within loci. Ten discrete generations of 20 
individuals were bred, using random mating with sepa- 
rate sexes, resulting in a data set of 200 individuals. We 
also ran some simulations with linkage between loci but 
the results were not much affected. Thus, we included 
only one example with high linkage with either 100 SNP 
over 1 Morgan or 10,000 SNP over 20 Morgan. 

Relatedness between all pairs of individuals was esti- 
mated from the marker data using each of the four (5), 
(6), (9) and (10) estimators described above and those of 
Van Raden (14) and (15). For the second estimator of 
Van Raden (15), monomorphic loci were ignored 
because for some loci the estimated value gk may be 
one or zero and the estimator becomes undefined. In 
addition, relatedness between individuals was calculated 
from the pedigree, using the tabular method [11] and 
this was considered to be the true value; this is true if 
there are many unlinked loci (avoiding noise due to 
finite sampling and co-segregation), which holds in the 
simulation. We also compared results to true IBD prob- 
abilities rather than pedigree coancestries. This is rele- 
vant for real situations where deviations from the 
average relationship exist due to linkage and finite sam- 
pling [14]. To obtain true IBD probabilities, we coded 
the alleles in the base population as unique alleles, with 
codes 1 through In. 
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Real data 

To illustrate the procedure on real data, a set of 1,827 
French Holstein bulls genotyped with the Illumina 
Bovine SNP50 BeadChip for 51,325 polymorphic (minor 
allele frequency > 0) SNP was used. The pedigree of 
these bulls was traced back as far as possible, including 
6,940 individuals. We used PEDIG [15] to compute the 
equivalent number of known generations: 4.22, and the 
average number of ancestors: 91.4. Estimators (5), (6), 
(9) (10), (14) and (15) were used to compute coances- 
tries among the genotyped animals. Some of the compu- 
tations used the preGSf90 software with methods 
detailed in [16]. 

Results 

The agreement between the molecular coancestries and 
molecular covariances and their expected values were 
checked by simulation. As for the comparison with IBD 
probabilities, the results were almost identical to those 
obtained with genealogical coancestries except for the sce- 
nario with a low number of markers. Table 1 shows the 
regression of the genealogical coancestry on the molecular 
coancestry or the molecular covariance. Very good agree- 
ment exists between expected (in estimators (5), (6), (9) 
and (10)) and observed values of intercept and slope when 
the number of SNP is very large; also, the coefficients of 
determination are close to 1. This occurs in the two con- 
sidered situations (p fixed or p variable among loci). The 
coefficients of determination are low when the number of 
SNP is low, especially when the allele frequencies in the 
base population are variable. 

For the simulated data, we implemented estimators of 
the genealogical coancestry based on molecular coancestry 



(equations (5) or (9)) and molecular covariance (equations 
(6) or (10)), using the true or estimated frequencies. In 
both cases (p fixed or random) estimates based on coan- 
cestry and covariance were almost identical and only the 
regression features when using fj M are presented in 

Table 2. As expected, the estimation works very well if the 
number of SNP is high. If it is low, the estimation of the 
intercept is biased upwards and the regression coefficient 
downwards. When the number of SNP used to estimate 

ff M decreases, the covariance between estimator and the 

true value decreases and the regression coefficient also 
decreases; the intercept increases as a direct consequence. 

When parameters of the true distribution of allele fre- 
quencies in the founder population are not known, we 
replaced them by their estimates according to Equations 
(11) and (12). Table 2 shows that this simple method 
works well with respect to the goodness of fit (R 2 ) but 
the estimates were biased (and inflated: b < 1) even for 
a high number of SNP. Indeed, Van Raden [5] already 
pointed out that base allele frequencies should be used 
to recover correct inbreeding coefficients. Table 3 gives 
the same results but for a scenario where loci are linked, 
with 1 (100 SNP) or 20 (10000 SNP) Morgans per 
gamete. Results were very similar to the unlinked situa- 
tion (Table 2), although the estimation improved for the 
small number of markers and worsened for the high 
number of markers. For the situation with linkage, we 
also analyzed what happens if we use IBD instead of the 
genealogical coancestry as the true values (right hand 
side of Table 3). The fit is better for IBD values than for 
genealogical coancestry, especially with a low number of 
markers. 



Table 1 Features of the regression of genealogical coancestry f on molecular coancestry (f M ) and molecular covariance 
(Cov M ) 



Nb 
SNP 


Nb 

replicates 


Regression on coancestry 






Regression on covariance 








a 


b 


R 2 


a 


b 


R 2 


p = 0.50 
















100 


1000 


-0.66 (0.03) 


1 .38 (0.06) 


0.69 (0.03) 


0.03 (0.00) 


2.77 (0.12) 


0.69 (0.03) 


10000 


50 


-0.99 (0.00) 


1.99 (0.01) 


1.00 (0.01) 


0.00 (0.00) 


3.98 (0.03) 


1.00 (0.01) 


Expected 
values 




p 2 +q 2 _ i 


1 

= 2 

2p<7 




0 


1 

— = 4 

pa 




prBetaQ, 1 


) 














100 


1000 


-1.01 (0.08) 


1.58 (0.10) 


0.52 (0.06) 


-0.22 (0.04) 


3.17 (0.21) 


0.52 (0.06) 


10000 


50 


-1.98 (0.02) 

p 2 +cj 2 + 2Var (p) 


2.97 (0.03) 
1 


0.99 (0.02) 


-0.50 (0.00) 

Var (p) 


5.95 (0.06) 
1 


0.99 (0.00) 


Expected 
values 




2pcj - 2 Var (p) 
= -2 


2pq - 2 Var (p) 
= 3 




pq - Var (p) 
= -0.5 


pcj - Var (p) 
= 6 





Intercept (a), slope (b) and coefficient of determination (R 2 ), with standard deviations across replicates, of the regression equation of genealogical coancestry f on 
molecular coancestry [f M ) and molecular covariance (Cov M ), based on simulated data, when the distribution of allele frequencies in the founders (p) is known and 
fixed (p = 0.5) or variable (p, ~ Beta(1,1)). 
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Table 2 Features of the regression of genealogical coancestry f on estimators 

Nb SNP Nb replicates Distribution of allelic frequencies known Distribution of allelic frequencies estimated from the data 

a b R 2 a b R 2 

p = 0.50* 

100 1000 0.03 0.69 0.69 0.09 0.63 0.69 

10000 50 0.00 0.99 1.00 0.09 0.91 1.00 

Expected 0.0 1.0 0.0 1.0 

values 

Pi ~ Beta{], 1)** 

100 1000 0.05 0.52 0.53 0.09 0.48 0.52 

10000 50 0.00 0.99 1.00 0.09 0.90 0.99 

Expected 0.0 1.0 0.0 1.0 

values 

Intercept (a), slope (b) and coefficient of determination (R 2 ), based on simulated data, when the distribution of allele frequencies in the founders is known or 
estimated from the data. 

*Estimators (5) and (6) are used; **Estimators (9) and (10) are used 



Results presented in Table 4 show that the Van Raden 
estimator (14) works less well than those proposed here 
based on molecular coancestry or molecular covariance. 
The reason appears to be that inferences about the dis- 
tribution of allele frequencies in the founder population 
are less accurate when based on the average across indi- 
viduals than when based on the average across loci. In 
fact, the results of the Van Raden estimator improve 
when the distribution of allele frequencies is estimated 
from the data of the last five generations (R 2 = 0.69 or 
0.96 for 100 and 10000 SNP, respectively) or when the 
population simulated comprises four generations of 50 
individuals per generation (R 2 = 0.53 or 0.96 for 100 
and 10000 SNP, respectively). Thus, strong drift exacer- 
bates the problem. Results from the second estimator of 
Van Raden (15) were almost identical to those from 
estimator (14). 



Table 3 Features of the regression of genealogical 
coancestry f and identity by descent on estimators 

Nb SNP Nb replicates Genealogical Identity by 

coancestry descent 





a 


b 


R 2 


a 


b 


R 2 


p = 0.50* 














100 1000 


0.09 


0.55 


0.60 


0.09 


0.68 


0.74 


10000 50 


0.09 


0.87 


0.95 


0.09 


0.91 


1.00 


Expected 
values 


0.0 


1.0 




0.0 


1.0 




Pi ~ Beta(], 1) 














100 1000 


0.09 


0.43 


0.48 


0.09 


0.54 


0.58 


1 0000 50 


0.09 


0.86 


0.95 


0.09 


0.90 


0.99 


Expected 
values 


0.0 


1.0 




0.0 


1.0 





Intercept (a), slope (b) and coefficient of determination (R 2 ), based on 
simulated data with linkage, when the distribution of allele frequencies in the 
founders is estimated from the data 

*Estimators (5) and (6) are used; **Estimators (9) and (10) are used 



Considering all coancestries among the 1827 bulls in 
the real data set, Table 5 summarizes the comparisons 
among all estimators. The average genealogical coances- 
try was 0.04 and whereas estimators (5) and (6) were 
severely biased, estimators (9), (10) and (14) were 
(slightly) biased in the opposite direction, showing that, 
as described by Hayes et al. [17], they effectively set the 
current population as the base. We will refer to this later. 
Estimators (5) versus (6) and (9) versus (10) showed the 
same bias; estimators (5-9) and (6-10) were perfectly cor- 
related, which is logical because they are linear transfor- 
mations of each other. Only estimator (14) provided a 
variance of coancestries similar to genealogical values, 
although all estimators show higher variances; this can 
also be seen in the simulations because the regression 
coefficients were less than 1. Estimator (15) is unbiased, 
but shows low correlations with all other methods and 
higher variance due to numerical instability caused by 



Table 4 Features of the regression equation of 
genealogical coancestry f on the first estimator of Van 
Raden 



Nb SNP 


Nb replicates 


Without linkage 


With linkage 








a 


b 


R 2 


a 


b 


R 2 


p = 0.50 
















100 


1000 


0.09 


0.57 


0.36 


0.09 


0.48 


0.30 


10000 


50 


0.09 


0.90 


0.57 


0.09 


0.85 


0.53 


Expected 




0.0 


1.0 




0.0 


1.0 




values 
















Pi ~ Beta{l, 


1) 














100 


1000 


0.09 


0.52 


0.33 


0.09 


0.44 


0.28 


10000 


50 


0.09 


0.90 


0.59 


0.09 


0.90 


0.58 


Expected 




0.0 


1.0 




0.0 


1.0 





values 



Intercept (a), slope (b) and coefficient of determination (R 2 ) using the first 
estimator of Van Raden (expression 14), based on simulated data without and 
with linkage, when the distribution of allele frequencies in the founders is 
estimated from the data 
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Table 5 Behaviour of estimators of coancestries (including self-coancestries) using pedigree (f) or molecular data for 
1827 Holstein bulls 





f* 


find) 


Cov M (2) 


ffit (5) 




fcovM ' 6 ' 




fvRX 


(14) 


fvRl ("I 


f 


0.11 


0.59 


0.67 


0.59 


0.59 


0.67 


0.67 




0.87 


0.48 




0.66 


0.04 


0.76 


1 


I 


0.76 


0.76 




0.59 


0.34 


Cov M (2) 


0.01 


-0.70 


0.01 


0.76 


0.76 


I 


1 




0.73 


0.41 


ffM (5) 


0.23 


-0.43 


0.21 


0.19 


1 


0.76 


0.76 




0.59 


0.34 


ffM 


-0.05 


-0.71 


-0.06 


-0.27 


0.37 


0.76 


0.76 




0.59 


0.34 


fcovM (6) 


0.23 


-0.43 


0.22 


0 


0.27 


0.13 


1 




0.73 


0.41 




-0.04 


-0.70 


-0.06 


-0.27 


0 


-0.27 


0.24 




0.73 


0.41 


fvRl^ 


-0.04 


-0.70 


-0.01 


-0.27 


0 


-0.27 


0 




0.13 


0.58 




-0.04 


-0.70 


-0.05 


-0.27 


0 


-0.27 


0 




0 


0.32 



Correlations (upper triangle), variances (diagonal; divided by 100) and average differences (lower triangle; row estimator minus column estimator) between the 
different estimators. 

*f is the genealogical coancestry calculated by the tabular method; for the other estimators, the corresponding formula in the text is indicated in parenthesis 



low minor allele frequencies. Estimator (14) is an ade- 
quate estimator with regard to closeness to genealogical 
coancestries. 

Discussion 

Genetic marker data are widely used to estimate the 
relatedness between individuals. Such marker-based 
relatedness is valuable in many areas of research on the 
evolution and conservation of natural populations, for 
example for estimating heritabilities, estimating popula- 
tion sizes, minimizing inbreeding in captive populations, 
and studying social structures and patterns of mating. 

Since the 1950s, many relatedness estimators have been 
proposed. However, in the last years, the use of high-den- 
sity SNP genotypes in 'genomic selection' has prompted 
the need of a genomic coancestry matrix [5,17], more 
accurate than the pedigree-based one, because true coan- 
cestry will be affected by linkage and finite sampling [14], 
and also because pedigree-based genealogical coancestry is 
obliged to assume an average relationship among founders 
(usually 0). Van Raden [5,18] has proposed the use of 
molecular covariance to derive (more exact) genealogical 
coancestries. Because of its simplicity and computational 
efficiency, the use of molecular covariances has quickly 
become widespread [13,7], although its origin is often 
erroneously attributed [7,19]. In fact, the earliest reference 
we are aware of its use is [18]. Here, we have recalled 
Cockerham's original derivation [8] and have provided an 
equivalent derivation. This provides further proof for the 
prediction methods of gene content of non genotyped ani- 
mals through pedigree relationships [20,21], which, in 
turn, are the basis for the single-step method to combine 
genomic and pedigree relationships [22,23]. 



We have also shown that, if we know the true distri- 
bution of the allelic frequencies in the founder popula- 
tion, it is possible to obtain very accurate estimates of 
genealogical coancestries from either molecular coances- 
tries or covariances if the number of markers is high. 
Even if allelic frequencies in the base population are 
unknown, and the results are severely biased, a high cor- 
relation between the estimated and the true genealogical 
values is maintained. 

In principle, it is possible to infer founder frequencies 
using either genealogical or marker-based relationships, 
possibly iteratively [7,21,24]. However, this is usually 
quite inaccurate and results in estimators that are very 
similar to crude population frequencies. In addition, a 
question remains on what is the ideal base population, 
which is unsolvable if no pedigree is known. In fact, 
using allelic frequencies in the observed population 
(crude means) is equivalent to defining, a population 
with the same gene frequencies as the observed popula- 
tion as the base generation, but with genotypic frequen- 
cies in Hardy- Weinberg equilibrium [17]. To change the 
base population, a correction based on Wright's F st coef- 
ficients has recently been suggested [25]. 

In practice, the computed matrix of coancestries (G) is 
used for two purposes. One purpose is the estimation of 
breeding values based on marker genotype data. In this 
case, if no other information is used (i.e., there is no use of 
pedigree-based relationships A), adding or removing con- 
stants from G is equivalent to fitting an overall (random) 
mean to the model for genetic values. Thus, estimates of 
breeding values will be simply shifted by a constant but 
their contrasts and selection decisions will be unaffected. 
In this case, the variance components need to be estimated 
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with the same G and will be inflated. However, if mixing 
of A and G is needed, as in the single-step procedure 
[23,26], then the two matrices need to be compatible. In 
this case, bias due to selection can be a problem. A recent 
correction based on F st suggested by Powell et al. [25] has 
been proposed for the single-step method and has been 
shown to increase accuracy and remove bias of genetic 
evaluations [27,28] . This correction works, roughly, by fix- 
ing the biases and variances of the estimator of coancestry 
that can be observed, for instance, in Table 5. 

For conservation purposes, most strategies work by 
minimizing the average coancestry [29], which can be 
expressed as a quadratic form x'Gx. The optimization of 
this expression is invariant to the addition (or multipli- 
cation) of any constant to G, unless more than one 
population is considered. If the G matrices are com- 
puted separately for each population, then they will not 
be compatible. If pooled current frequencies are used, 
then the more variable or more abundant population 
will be favored. Possibly, in this case, a clear definition 
of the allele frequencies (and thus the base population) 
to compute coancestries is needed. 

In addition, the real data example shows that, in this 
data set, estimators based on molecular covariances are 
more similar and more compatible with those based on 
pedigree, than estimators based on coancestries, in parti- 
cular estimator (14). We do not recommend estimator 
(15) because it does not agree well with genealogical 
coancestries, the distribution of coancestries has more 
variance, and it is unstable for minor allelic frequencies 
close to 0 and undefined for monomorphic loci. Unfor- 
tunately this estimator is recommended by some authors 
[19,7,13]. 

Conclusions 

The rationale to compare and estimate genealogical 
coancestries based on molecular empirical coancestries 
or covariances has been shown for any outbred or inbred 
population, and different estimators have been developed 
which account for variation in allele frequencies between 
loci. In practice, different estimators lead to similar con- 
clusions. Estimators are easy to construct but suffer from 
a lack of knowledge on the distribution of allele frequen- 
cies in the base population. This is, however, not a pro- 
blem for most practical applications. 

Appendix 

We present here a formal derivation of relationship 
between genealogical and molecular coancestries and 
covariances. This is an alternative derivation to that of 
Cockerham [8] and to our knowledge it has not been 
shown so far. We will prove it for a population of 
outbred individuals and will sketch the proof for a 
population of inbred individuals. 



Outbred individuals 

There are three ways in which a pair of relatives can 
share genes identical by descent (IBD) Crow and 
Kimura (Figure 1); k 0 , 2k\ and k 2 are the probabilities 
that x and y share no genes, just one gene and both 
genes IBD (k 0 + 2k± + k 2 = 1). The coancestry coeffi- 
cient between two individuals is thus defined as: 

= (2fe!/4) + (fe 2 /2) . 

The joint genotypic distribution of non-inbred rela- 
tives i and j is well known (see for example [30]), as 
shown in Table 6. The expected value of the molecular 
coancestry averaged over the nine rows will be 

E if Ma) = Y2f M x f re <1 uenc Y- 
After some algebra, 

E (ft*,) =P 2 +q 2 + 2P4 [2*i/4 + Jfe/2] 
= p 2 + q 2 + 2pqfy. 

The expected value of the molecular coancestry aver- 
aged over the nine rows will be, given that F(gi) = E(gj) 
= P> 

E (Cov Mi ) = E (gig } ) - E (gf) E (gj) 

= (fen + 2fei + fe 2 ) p 2 + (2fei/4) pel + (fe 2 /2) pej - p 2 

= Ptfij- 



Inbred individuals 

When either relative may be inbred, we need nine ways 
in which a pair of relatives can share genes identical by 
descent [31] (Figure 2). The following relationships hold: 



1,00 



2k 



.00 



1,00 



'.,10 



fe: 



.01 



1 """ "2 "0 "0 "0 

+2k\° + 2kf + hi 1 = 1 



Fj = k l » + 2k\° + fej 1 + k\ l 



Fj = kl l + 2fe° 1 + kl 1 + k\ l 



o o 



o o 



o — o 



o o 



o — o 



o — o 



ko 



2k, 



Figure 1 Three modes of genetic identity-by-descent between 
two outbred individuals at a single locus 
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Table 6 Joint genotypic distribution of non-inbred relatives i and j 
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G 


f„ 

'M 
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Frequency 


A A 


A4 


1 


1 


1 


k 0 p 4 + 2/c 1 p 3 + k-jp 2 


AA 


Aa 


0.5 


1 


0.5 


k 0 2p 3 q + 2kip 2 q 


Aa 


AA 


0.5 
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1 


k 0 2p 3 q + 2kip 2 q 


AA 


aa 


1 


1 


0 


k 0 2p 2 q 2 


aa 


AA 


0 


0 


1 


k 0 2p 2 q 2 


Aa 


Aa 


0.5 


0.5 


0.5 


k 0 4p 2 q 2 + 2k,pq + k 2 2pq 


Aa 


aa 


0.5 


0.5 


0 


k 0 2pq 3 + 2k,pq 2 


aa 


Aa 


0.5 


0 


0.5 


kolpq 3 + 2kipq 2 


aa 


aa 


1 


0 


0 


k 0 q A + 2/c,q 3 + k 2 q 2 
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Figure 2 Nine ways in which a pair of relatives can share genes identical by descent 
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Table 7 Joint genotypic distribution of inbred relatives i and j 
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f tj = (1/2) fe?° + (1/2) kf + k{° + fe? 1 + kf = 1. 

The joint genotypic distribution of non-inbred rela- 
tives i and / when either relative may be inbred is also 
well known (Table 7). First we need to define nine ways 
in which a pair of relatives can share genes identical by 
descent and the corresponding k-coefficients. 

After algebra, we arrive to the same expressions as 
above for ^(/M^and £(/coviv%). Note that the proof of 
Cockerham [8] is general and applies to either outbred 
or inbred populations. 
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